-
Notifications
You must be signed in to change notification settings - Fork 453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding gujarati vocabulry dec 4 #1811
Conversation
doctr/datasets/vocabs.py
Outdated
@@ -22,6 +22,13 @@ | |||
"hindi_letters": "अआइईउऊऋॠऌॡएऐओऔअंअःकखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसह", | |||
"hindi_digits": "०१२३४५६७८९", | |||
"hindi_punctuation": "।,?!:्ॐ॰॥॰", | |||
"gujarati_vowels": "અઆઇઈઉઊઋએઐઓઔઅંઅઃ ", | |||
"gujarati_digits":"૦૧૨૩૪૫૬૭૮૯", | |||
"gujarati_diacritics_consonants":"""કકાકિકીકુકૂકૃકેકૈકોકૌકંકઃખખાખિખીખુખૂખૃખેખૈખોખૌખંખઃગગાગિગીગુગૂગૃગેગૈગોગૌગંગઃઘઘાઘિઘીઘુઘૂઘૃઘેઘૈઘોઘૌઘંઘઃઙઙાઙિઙીઙુઙૂઙૃઙેઙૈઙોઙૌઙંઙઃચચાચિચીચુચૂચૃચેચૈચોચૌચંચઃછછાછિછીછુછૂછૃછેછૈછોછૌછંછઃ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarjil77 I tested a bit with your added vocab which raised some issues .. because we can't encode it - if i understood it correctly the leading letter combined with the dotted circle (for example: કૌ) is combined to one character but programmatically it's counted as 2 characters .. is there anyway to make these strings unicode conform ?
So at the end that each character in an image corresponds to 1 encoded character
if i filter your diacritics i get the following:
ઃકખગઘઙચછજઝઞટઠડઢણતથદધનપફબભમયરલવશષાિીુૂૃેૈોૌ્
btw with multiline strings the string needs to end with \
otherwise it's counted as linebreak
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarjil77 Something like this:
"gujarati_letters": "તગખઢરજયશઆઐઊૂેપફુ્ઓૈાથીડૃદઠવનલષકિઅભઘઉઔઝઙઇઞઈધૌછટચબોમએણઋ",
"gujarati_digits":"૦૧૨૩૪૫૬૭૮૯",
"gujarati_punctuation": "૰ઽ◌ંઃ॥ૐ" + "૱",
length: 103
all chars: તગખઢરજયશઆઐઊૂેપફુ્ઓૈાથીડૃદઠવનલષકિઅભઘઉઔઝઙઇઞઈધૌછટચબોમએણઋ૦૧૨૩૪૫૬૭૮૯૰ઽ◌ંઃ॥ૐ૱!"#$%&'()*+,-./:;<=>?@[]^_`{|}~
? Not sure anyway 😅
This is what i get if i deduplicate it in python
the single diacritics (as addition to a char) are counted as standalone symbol
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @felixdittrich92 , noted, i am not sure right now, but i will look further into this.
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hello @felixdittrich92, you are right it is considering 2 characters like "ફુ્" which is diacritic which is taking 6 bytes. So in order to handle the diacritics to consider as a single character, we can use NFC (Normalization Form C) which will combine character with their diacritics into single code character and does not change the actual encoding or byte representation.
for eg:
import unicodedata
txt = "ફુ્"
encoded_string = txt.encode()
normalized_text = unicodedata.normalize('NFC', txt)
print(f'encoded string is:',encoded_string)
print(f'the length of encoded string is: {len(encoded_string)} ')
print(f'normalized_text is:', normalized_text)
print(f'the length of normalized encoded string is:{len(normalized_text)}')
output:
encoded string is: b'\xe0\xaa\xab\xe0\xab\x81\xe0\xab\x8d'
the length of encoded string is: 9
normalized_text is: ફુ્
the length of normalized encoded string is:3
please do have a look on this, and i do not know how other people have added diacritics, here we can also add just consonants and vowels but it will not make any sense.
Let me know your thoughts.
@sarjil77 Take a look here should be enough to copy paste these changes: main...felixdittrich92:doctr:gujarati-vocab-test Tested with your "full" vocab i can completely reproduce it so all chars are added and a string can be encoded char by char :) Before you should pull the latest changes from main and rebase your branch :) |
@felixdittrich92 The gujarati letters you provided in your commits contains the significant portion of Gujarati alphabets but is not entirely complete you are missing 2 major vowels and 3 consonants so total 5 letters are missing. based on diacritcs which i have provided before you are right, but this 5 letters are additional ones and doesnt have diacritics so i missed them (sorry) but they are also frequently used so we cant ignore them. And i would strongly recommend to keep vowels and consonants separate, it will be better for traceability and in future it may help to trace. |
Then i would say feel free to add the missing ones - what you see is your vocab but deduplicated :) |
@sarjil77 Don't miss to rebase before please in the meanwhile i added a test case to Check the VOCABS entry values for duplicates :) |
@felixdittrich92 i think done from my side :) haa, thanks. |
@sarjil77 your branch needs still to be rebased (see there is a conflicting file) :)
Additional the docs entry is missing take a look in my provided branch :) |
i think now it is good. :) |
@sarjil77 it's still not rebased on main :)
And about the documentation entry if you have added more chars i think the number and char string has changed also ;) so please fix this :D |
changes with vocab and documentation dec 10
Bumps the github-actions group with 1 update: [JamesIves/github-pages-deploy-action](https://github.com/jamesives/github-pages-deploy-action). Updates `JamesIves/github-pages-deploy-action` from 4.7.1 to 4.7.2 - [Release notes](https://github.com/jamesives/github-pages-deploy-action/releases) - [Commits](JamesIves/github-pages-deploy-action@v4.7.1...v4.7.2) --- updated-dependencies: - dependency-name: JamesIves/github-pages-deploy-action dependency-type: direct:production update-type: version-update:semver-patch dependency-group: github-actions ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
updating both vocab and documentation
hey @felixdittrich92 , i think i have chnaged both the documentation and vocab, can you please look into it, and from my side i have checked for any conflicts. :) |
@sarjil77 Looks like you merged the main branch into your feature branch instead of rebasing your branch on main 😅 2 options:
👍🏼 |
OKay i am closing this PR and will do as you suggested. |
Don't worry that's little things you will grow on believe me :) Without making things wrong we wouldn't learn ^^ |
here i am adding the gujarati vocabulary